IEEE ICPADS 2020

Session F1

Data Processing

Conference

10:35 AM — 11:55 AM HKT

Local

Dec 2 Wed, 6:35 PM — 7:55 PM PST

A social link based private storage cloud

Michalis Konstantopoulos, Nikos Chondros and Mema Roussopoulos

0

In this paper, we present O^3, a social link based private storage cloud for decentralized collaboration. O^3 allows users to share and collaborate on collections of files (shared folders) with others, in a decentralized manner, without the need for intermediaries (such as public cloud storage servers) to intervene. Thus, users can keep their working relationships (who they work with) and what they work on private from third parties. Using benchmarks and traces from real workloads, we experimentally evaluate O^3 and demonstrate that the system scales linearly when synchronizing increasing numbers of concurrent users, while performing on-par with the ext4 non-version-tracking filesystem.

Enabling Generic Verifiable Aggregate Query on Blockchain Systems

Yanchao Zhu, Zhao Zhang, Cheqing Jin, and Aoying Zhou

0

Currently, users in a blockchain system must maintain all the data on the blockchain and query the data locally to ensure the integrity of the query results. However, since data is updated in an append-only way, resulting in a huge amount of data, it will take considerable maintenance costs to users. In this paper, we present an approach to support verifiable aggregate queries on blockchain systems that alleviates both storage and computing costs for users, while ensuring the integrity of the query results. We design an accumulator-based authenticated data structure (ADS) that supports verifiable multidimensional aggregate queries (i.e., aggregate queries with multiple selection predicates). The structure is built for each block, based on which verifiable multidimensional aggregate queries within a single block or involving multiple blocks are supported. We further optimize the performance by merging ADSs on different blocks to reduce the verification time at the client side and reduce the verification object (VO) size. Extensive experiments demonstrate the effectiveness and efficiency of our proposed approach.

SrSpark: Skew-resilient Spark based on Adaptive Parallel Processing

Yijie Shen, Jin Xiong and Dejun Jiang

0

MapReduce-based SQL processing systems, e.g., Hive and Spark SQL, are widely used for big data analytic applications due to automatic parallel processing on largescale machines. They provide high processing performance when loads are balanced across the machines. However, skew loads are not rare in real applications. Although many efforts have been made to address the skew issue in MapReduce-based systems, they can neither fully exploit all available computing resources nor handle skews in SQL processing. Moreover, none of them can expedite the processing of skew partitions in case of failures. In this paper, we present SrSpark, a MapReduce-based SQL processing system that can make full use of all computing resources for both non-skew loads and skew loads. To achieve this goal, SrSpark introduces fine-grained processing and workstealing into the MapReduce framework. More specifically, SrSpark is implemented based on Spark SQL. In SrSpark, partitions are further divided into sub-partitions and processed in sub-partition granularity. Moreover, SrSpark adaptively uses both intra-node and inter-node parallel processing for skew loads according to available computing resources in realtime. Such adaptive parallel processing increases the degree of parallelism and reduces the interaction overheads among the cooperative worker threads. In addition, SrSpark checkpoints sub-partition��s processing results periodically to ensure fast recovery from failures during skew partition processing. Our experiment results show that for skew loads, SrSpark outperforms Spark SQL by up to 3.5x, and 2.2x on average, while the performance overhead is only about 4% under non-skew loads.

Optimizing Multi-way Theta Join for Data Skew in Sub-Second Stream Computing

Xiaopeng Fan, Xinchun Liu, Yang Wang, Youjun Wang, and Jing Li

0

In sub-second stream computing, the answer to a complex query usually depends on operations of aggregation or join on streams, especially multi-way theta join. Some attribute keys are not distributed uniformly, which is called the data intrinsic skew problem, such as taxi car plate in GPS trajectories and transaction records, or stock code in stock quotes and investment portfolios etc. In this paper, we define the concept of key redundancy for single stream as the degree of data intrinsic skew, and joint key redundancy for multi-way streams. We present an execution model for multi-way stream theta joins with a fine-grained cost model to evaluate its performance. We propose a solution named Group Join (GroJoin) to make use of key redundancy during transmission and execution in a cluster. GroJoin is adaptive to data intrinsic skew in the way that it depends on the grouping condition we find out, i.e., the selectivity of theta join results should be smaller than 25%. Experiments are carried out by our MS-Generator to produce multi-way streams, and the simulation results show that GroJoin can decrease at most 45% transmission overheads with different key redundancies and value-key proportionality coefficients, and reduce at most 70% query delay with different key distributions. We further implement GroJoin in Multi-Way Stream Theta Join by Spark Streaming. The experimental results demonstrate that there are about 40%~50% join latency reduced after our optimization with a very small computation cost.

Session Chair

Weigang Wu (Sun Yat-sen University)

Session F2

Resource and Data Management

Conference

10:35 AM — 11:55 AM HKT

Local

Dec 2 Wed, 6:35 PM — 7:55 PM PST

OOOPS: An Innovative Tool for IO Workload Management on Supercomputers

Lei Huang and Si Liu

1

Modern supercomputer applications are demanding high-performance storage resources in addition to fast computing resources. However, these storage resources, especially parallel shared filesystems, have become the Achilles�� heel of many powerful supercomputers. Due to the lack of mechanism of IO resource provisioning on the file server side, a single user��s IOintensive work running on a small number of nodes can overload the metadata server and result in global filesystem performance degradation and even unresponsiveness. To tackle this issue, we developed an innovative tool, Optimal Overloaded IO Protection System (OOOPS). This tool is designed to control the IO workload from applications side. Supercomputer administrators can easily assign the maximum number of function calls of open() and stat() allowed per second. OOOPS can automatically detect and throttle intensive IO workload to protect parallel shared filesystems. It also allows supercomputer administrators to dynamically adjust how much metadata throughput one job can utilize during the job runs without interruption.

URFS: A User-space Raw File System based on NVMe SSD

Yaofeng Tu, Yinjun Han, Zhenghua Chen, Zhengguang Chen and Bing Chen

1

NVMe (Non-Volatile Memory Express) is a protocol designed specifically for SSD (Solid State Drive), which has significantly improved the performance of SSD storage devices. However, the traditional kernel-space IO path hinders the performance of NVMe SSD devices. In this paper, a userspace raw file system (URFS) based on NVMe SSD is proposed. Through the design of the user-space multi-process shared cache, multiple applications can share access to SSD to reduce the amount of SSD access; NVMe-oriented log-free data layout and Multi-granularity IO queue elastic separation technology are used to improve system performance and throughput. Experiments show that, compared to traditional file systems, URFS performance is improved by more than 23% in CDN (Content Delivery Network) scenarios, and URFS performance is improved more in small file scenarios and read-intensive scenarios.

DyRAC: Cost-aware Resource Assignment and Provider Selection for Dynamic Cloud Workloads

Yannis Sfakianakis, Manolis Marazakis and Angelos Bilas

0

A primary concern for cloud users is how to minimize the total cost of ownership of cloud services. This is not trivial to achieve due to workload dynamics. Users need to select the number, size, type of VMs, and the provider to host their services based on available offerings. To avoid the complexity of re-configuring a cloud service, related work commonly approaches cost minimization as a packing problem that minimizes the resources allocated to services. However, this approach does not consider two problem dimensions that can further reduce cost: (1) provider selection and (2) VM sizing. In this paper, we explore a more direct approach to cost minimization by adjusting the type, number, size of VM instances, and the provider of a cloud service (i.e. a service deployment) at runtime. Our goal is to identify the limits in service cost reduction by online re-deployment of cloud services. For this purpose, we design DyRAC, an adaptive resource assignment mechanism for cloud services that, given the resource demands of a cloud service, estimates the most cost-efficient deployment. Our evaluation implements four different resource assignment policies to provide insight into how our approach works, using VM configurations of actual offerings from main providers (AWS, GCP, Azure). Our experiments show that DyRAC reduces cost by up to 33% compared to typical strategies.

WMAlloc: A Wear-Leveling-Aware Multi-Grained Allocator for Persistent Memory File Systems

Shun Nie, Chaoshu Yang, Runyu Zhang, Wenbin Wang, Duo Liu and Xianzhang Chen

0

Emerging Persistent Memories (PMs) are promised to revolutionize the storage systems by providing fast, persistent data access on the memory bus. Therefore, persistent memory file systems are developed to achieve high performance by exploiting the advanced features of PMs. Unfortunately, the PMs have the problem of limited write endurance. Furthermore, the existing space management strategies of persistent memory file systems usually ignore this problem, which can cause that the write operations concentrate on a few cells of PM. Then, the unbalanced writes can damage the underlying PMs quickly, which seriously damages the data reliability of the file systems. However, existing wear-leveling-aware space management techniques mainly focus on improving the wear-leveling accuracy of PMs rather than reducing the overhead, which can seriously reduce the performance of persistent memory file systems. In this paper, we propose a Wear-Leveling-Aware Multi-Grained Allocator, called WMAlloc, to achieve the wear-leveling of PM while improving the performance for persistent memory file systems. WMAlloc adopts multiple heap trees to manage the unused space of PM, and each heap tree represents an allocation granularity. Then, WMAlloc allocates less-worn required blocks from the heap tree for each allocation. We implement the proposed WMAlloc in Linux kernel based on NOVA, a typical persistent memory file system. Compared with DWARM, the state-of-the-art and wear-leveling-aware space management technique, experimental results show that WMAlloc can achieve 1.52�� lifetime of PM and 1.44�� performance improvement on average.

Session Chair

Huawei Huang (Sun Yat-sen University)

Session F3

Secure and Reliable Systems

Conference

10:35 AM — 11:55 AM HKT

Local

Dec 2 Wed, 6:35 PM — 7:55 PM PST

Optimizing Complex OpenCL Code for FPGA: A Case Study on Finite Automata Traversal

Marziyeh Nourian, Mostafa Eghbali Zarch and Michela Becchi

0

While FPGAs have been traditionally considered hard to program, recently there have been efforts aimed to allow the use of high-level programming models and libraries intended for multi-core CPUs and GPUs to program FPGAs. For example, both Intel and Xilinx are now providing toolchains to deploy OpenCL code onto FPGA. However, because the nature of the parallelism offered by GPU and FPGA devices is fundamentally different, OpenCL code optimized for GPU can prove very inefficient on FPGA, in terms of both performance and hardware resource utilization.
This paper explores this problem on finite automata traversal. In particular, we consider an OpenCL NFA traversal kernel optimized for GPU but exhibiting FPGA-friendly characteristics, namely: limited memory requirements, lack of synchronization, and SIMD execution. We explore a set of structural code changes, custom and best-practice optimizations to retarget this code to FPGA. We showcase the effect of these optimizations on an Intel Stratix V FPGA board using various NFA topologies from different application domains. Our evaluation shows that, while the resource requirements of the original code exceed the capacity of the FPGA in use, our optimizations lead to significant resource savings and allow the transformed code to fit the FPGA for all considered NFA topologies. In addition, our optimizations lead to speedups up to 4x over an already optimized code-variant aimed to fit the NFA traversal kernel on FPGA. Some of the proposed optimizations can be generalized for other applications and introduced in OpenCL-to-FPGA compiler.

FastCredit: Expediting Credit-based Proactive Transports in Datacenters

Dezun Dong, Shan Huang, Zejia Zhou, Wenxiang Yang and Hanyi Shi

0

Recent proposals have leveraged emerging creditbased proactive transports to achieve high throughput low latency datacenter network transports. Particularly, those transports that employ hop-by-hop credits have the merits of fast convergence, low buffer occupancy, and strong congestion avoidability. However, they fairly transmit long flows and latency-sensitive short flows, which will cause the transmission latency of short flows and the average flow completion time increased. Although flow scheduling mechanisms have studied extensively to accelerate short flow transmission, they are hard to be directly applied in credit-based transports. The root cause is that most traditional flow scheduling mechanisms mainly work in the long queue containing flows in various sizes, while credit-based proactive transports maintain the extremely short bounded queue, near zero.

Based on this observation, this paper makes the first attempt to accelerate short-flow scheduling in credit-based proactive transport, and proposed FastCredit. FastCredit can be used as a general building block to expedite short flows in creditbased proactive transports. In FastCredit, we schedule credit transmission at both receivers and switches to indirectly perform flow scheduling, and develop a mechanism to mitigate credit waste and improve network goodput. Compared to the stateof- the-art credit-based transport protocol, FastCredit reduces average flow completion time to 0.78x and greatly improves the short flow transmission latency to 0.51x in realistic workloads. Especially, FastCredit reduces average flow completion time to 0.76x under incast circumstances and 0.62x in many-to-one traffic mode. Furthermore, FastCredit still maintains the advantages of short queue and high throughput.

Proactive Failure Recovery for Stateful NFV

Zhenyi Huang and Huawei Huang

1

Network Function Virtualization (NFV) technology is viewed as a significant component of both the fifth-generation (5G) communication networks and edge computing. In this paper, through reviewing the state-of-the-art work on applying NFV to edge computing, we identify that an urgent research challenge is to provide the proactive failure recovery mechanism for the stateful NFV. To realize such proactive failure recovery, we propose a prediction-based algorithm for redeploying the stateful NFV instances in real-time when network failures occur. The proposed algorithm is based on relax and rounding technique. The theoretical performance guarantee is also analyzed rigorously. Simulation results show that the proposed failure recovery algorithm outperforms the reactive-manner baselines significantly in terms of redeployment latency.

TEEp: Supporting Secure Parallel Processing in ARM TrustZone

Zinan Li, Wenhao Li, Yubin Xia and Binyu Zang

0

Machine learning applications are getting prevelent on various computing platforms, including cloud servers, smart phones, IoT devices, etc. For these applications, security is one of the most emergent requirements. While trusted execution environment (TEE) like ARM TrustZone has been widely used to protect critical prodecures including fingerprint authentication and mobile payment, state-of-the-art implementations of TEE OS lack the support for multi-threading and are not suitable for computing-intensive workloads. This is because current TEE OSes are usually designed for hosting security critical tasks, which are typically small and noncomputing- intensive. Thus, most of TEE OSes do not support multi-threading in order to minimize the size of the trusted computing base (TCB). In this paper, we propose TEEp, a system that enables multi-threading in TEE without weakening security, and supports existing multi-threaded applications to run directly in TEE. Our design includes a novel multithreading mechanism based on the cooperation between the TEE OS and the host OS, without trusting the host OS. We implement our system based on OP-TEE and port it to two platforms: a HiKey 970 development board as mobile platform, and a Huawei Hi1610 ARM server as server platform. We run TensorFlow Lite on the development board and TensorFlow on the server for performance evaluation in TEE. The result shows that our system can improve the throughput of TensorFlow Lite on 5 models to 3.2x when 4 cores are available, with 13.5% overhead compared with Linux on average.

Session Chair

Yu Huang (Nanjing University)

Program at a Glance

IEEE International Conference on Parallel and Distributed Systems (IEEE ICPADS 2020)

Data Processing

A social link based private storage cloud

Enabling Generic Verifiable Aggregate Query on Blockchain Systems

SrSpark: Skew-resilient Spark based on Adaptive Parallel Processing

Optimizing Multi-way Theta Join for Data Skew in Sub-Second Stream Computing

Session Chair

Resource and Data Management

OOOPS: An Innovative Tool for IO Workload Management on Supercomputers

URFS: A User-space Raw File System based on NVMe SSD

DyRAC: Cost-aware Resource Assignment and Provider Selection for Dynamic Cloud Workloads

WMAlloc: A Wear-Leveling-Aware Multi-Grained Allocator for Persistent Memory File Systems

Session Chair

Secure and Reliable Systems

Optimizing Complex OpenCL Code for FPGA: A Case Study on Finite Automata Traversal

FastCredit: Expediting Credit-based Proactive Transports in Datacenters

Proactive Failure Recovery for Stateful NFV

TEEp: Supporting Secure Parallel Processing in ARM TrustZone

Session Chair